{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reinforcement Learning (RL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RL problems can be described as Markov decision processes (MDPs), which can be defind by (S, A, T, H, s(0), R), where:\n",
    "\n",
    "* $S$ - set of states\n",
    "* $A$ - set of actions (inputs)\n",
    "* $T$ - dynamics model (set of probability distributions)\n",
    "* $H$ - horizon (number of timesteps considered)\n",
    "* $s(0)$ - initial state\n",
    "* $R$ - reward function ($R: S x A \\rightarrow R$) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Quadratic Regulator (LQR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear quadratic regulators (LQRs) are a class of MDPs whose dynamic model is given by:\n",
    "\n",
    "$s(t+1) = A(t)s(t) + B(t)u(t) + w(t)$\n",
    "\n",
    "where:\n",
    "\n",
    "* $t = 0, ..., H$\n",
    "* $A(t) \\in R^{nxn}$\n",
    "* $B(t) \\in R^{nxp}$\n",
    "* $w(t)$ is a random variable (zero mean with finate variance)\n",
    "\n",
    "To calculate the reward for being in $s(t)$ and taking action $u(t)$:\n",
    "\n",
    "$-s(t)^{T}Q(t)s(t) - u(t)^{T}R(t)u(t)$\n",
    "\n",
    "$Q(t)$ and $R(t)$ paramertize the reward function. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Differential Dynamic Programming (DDP)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DDP can approximatley solve continuous state-space MDPs by iterating through these steps:\n",
    "\n",
    "* compute a linear appoximation to the dynamics and a quadratic approximiation to the reward function around the trajectory obtained when using the current policy\n",
    "* compute the optimal policy for the LQR problem obtained in Step 1 and set the current policy equal to the optimal policy for the LQR problem\n",
    "\n",
    "So, like all dynamic programming, it's a boostrappy dance between setting the dynamics and reward and then using that to set the policy. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
